In this notebook we are going to use gradient descent to estimate the parameters of a model. In this case we are going to compute the parameters to convert temperatures fom Farenheit to Kelvin.
Given that we are approaching this from a machine learning perspective, we are going to determine our scaling factor and offset value by gradient descent, given some example temperatures on both scales. In other words, we are going to learn the parameters from the data.
First, some imports:
In [1]:
import numpy as np
import pandas as pd
Our data set:
In [2]:
INDEX = ['Boiling point of He',
'Boiling point of N',
'Melting point of H2O',
'Body temperature',
'Boiling point of H2O']
X = np.array([-452.1, -320.4, 32.0, 98.6, 212.0])
Y = np.array([4.22, 77.36, 273.2, 310.5, 373.2])
Show our data set in a table:
In [3]:
pd.DataFrame(np.stack([X, Y]).T, index=INDEX,
columns=['Fahrenheit ($x$)', 'Kelvin ($y$)'])
Out[3]:
Temperatures can be converted using a linear model of the form $y=ax+b$.
$x$ and $y$ are samples in our dataset; $X=\{x_0...X_N\}$ and $Y=\{y_0...Y_N\}$ while $a$ and $b$ are the model parameters.
Lets try initialising the parameters $a$ randomly and $b$ to 0 and see what it predicts:
In [4]:
# Lets initialise `a` to between 1.0 and 2.0; it is therefore impossible
# for it to choose a (nearly) correct value at the start, forcing our model to do some work.
a = np.random.uniform(1.0, 2.0, size=())
b = 0.0
print('a={}, b={}'.format(a, b))
Y_pred = X * a + b
pd.DataFrame(np.stack([X, Y, Y_pred]).T, index=INDEX,
columns=['Fahrenheit ($x$)', 'Kelvin ($y$)', '$y_{pred}$'])
Out[4]:
In [5]:
sqr_err = (Y_pred - Y)**2
pd.DataFrame(np.stack([X, Y, Y_pred, sqr_err]).T, index=INDEX,
columns=['Fahrenheit ($x$)', 'Kelvin ($y$)', '$y_{pred}$', 'squared err ($\epsilon$)'])
Out[5]:
We reduce the error by taking the gradient of the squared error with respect to the parameters $a$ and $b$ and iteratively modifying the values of $a$ and $b$ in the direction of the negated gradient.
Lets determine the expressions for the gradient of the squared error $\epsilon$ with respect to $a$ and $b$:
$\epsilon_i = (ax_i + b - y_i)^2 = a^2x_i^2 + 2abx_i - 2ax_iy_i + b^2 + y_i^2 - 2by_i$
In terms of $a$: $\epsilon_i = a^2x_i^2 + a(2bx_i - 2x_iy_i) + b^2 + y_i^2 - 2by_i$
So ${d\epsilon_i\over{da}} = 2ax_i^2 + 2bx_i - 2x_iy_i$
In terms of $b$: $\epsilon = b^2 + b(2ax_i - 2y_i) + a^2x_i^2- 2ax_iy_i - 2by_i$
So ${d\epsilon_i\over{db}} = 2b + 2ax_i - 2y_i$
The above expressions apply to single samples only. To apply them to all of our 5 data points, we need to use the mean squared error. The mean squared error is the sum of the individual errors divided by the number of data points $N$. The derivative of the mean squared error w.r.t. $a$ and $b$ will also be the sum of the individual derivatives, divided by $N$.
Gradient descent is performed iteratively; each parameter is modified independently as so:
$a' = a - \gamma {d\epsilon_i\over{da}}$
$b' = b - \gamma {d\epsilon_i\over{db}}$
where $\gamma$ is the learning rate.
We now have all we need to define some gradient descent helper functions:
In [6]:
def iterative_gradient_descent_step(a, b, lr):
"""
A single gradient descent iteration
:param a: current value of `a`
:param b: current value of `b`
:param lr: learning rate
:return: a tuple `(a_next, b_next)` that are the values of `a` and `b` after the iteration.
"""
# Derivative of a and b w.r.t. epsilon:
da_depsilon = (2 * a * X**2 + 2 * b * X - 2 * X * Y).mean()
db_depsilon = (2 * b + 2 * a * X - 2 * Y).mean()
# Gradient descent:
a = a - da_depsilon * lr
b = b - db_depsilon * lr
# Return new values
return a, b
def state_as_table(a, b):
"""
Helper function to generate a Pandas DataFrame showing the current state, including predicted values and errors
:param a: current value of `a`
:param b: current value of `b`
:return: tuple `(df, mean_sqr_err)` where `df` is the Pandas DataFrame and `sqr_err` is the mean squared error
"""
Y_pred = X * a + b
sqr_err = (Y_pred - Y)**2
df = pd.DataFrame(np.stack([X, Y, Y_pred, sqr_err]).T, index=INDEX,
columns=['Fahrenheit ($x$)', 'Kelvin ($y$)', '$y_{pred}$', 'squared err ($\epsilon$)'])
return df, sqr_err.mean()
Define learning rate and show initial state:
In [7]:
LEARNING_RATE = 0.00001
N_ITERATIONS = 50000
df, mean_sqr_err = state_as_table(a, b)
print('a = {}, b = {}, mean sqr. err. = {}'.format(a, b, mean_sqr_err))
df
Out[7]:
In [31]:
for i in xrange(N_ITERATIONS):
a, b = iterative_gradient_descent_step(a, b, LEARNING_RATE)
df, mean_sqr_err = state_as_table(a, b)
print('a = {}, b = {}, mean sqr. err. = {}'.format(a, b, mean_sqr_err))
df
Out[31]:
The formula for conversion from Farenheit to Kelvin is:
$T_K = {5\over9}T_F + 255.372$
Therefore:
$a = 0.556$
$b = 255.372$
If the above cell was run enough times, $a$ and $b$ should have reached values that are close to the above ideal values (some error is expected as the input data has some small rounding errors).
There are some problems with the model above; namely that a very low learning rate and huge number of iterations were required. This is because using a larger learning rate causes the parameters to take huge steps in one direction or another, often causing them to oscillate between negative and positive values with rapidly increasing magnitudes, after which the model 'explodes'.
This would be addressed by standardising the data in X
and Y
by subtracting the mean and dividing by the standard deviation. This would allow a much higher learning rate and smaller number of iterations to suffice.
That said, this notebook demonstrates the use of gradient descent to train a simple linear regression model; I hope you have found it helpful.